Skip to content

feat(cloud): Hetzner control plane IaC + data plane naming + legacy milady-core deprecation#7890

Merged
lalalune merged 3 commits into
developfrom
feat/cloud-iac-hetzner
May 22, 2026
Merged

feat(cloud): Hetzner control plane IaC + data plane naming + legacy milady-core deprecation#7890
lalalune merged 3 commits into
developfrom
feat/cloud-iac-hetzner

Conversation

@standujar
Copy link
Copy Markdown
Collaborator

@standujar standujar commented May 22, 2026

Summary

Three coordinated changes that move the Hetzner setup from "manually-poked VMs" to a proper two-tier architecture:

  1. Terraform IaC for the control plane: declares the persistent VM(s) that host the orchestrator daemon (provisioning-worker, agent-router, headscale, cloudflared). Uses hetznercloud/hcloud + cloudflare providers, Cloudflare R2 as S3 state backend. Includes cloud-init bootstrap, tfvars examples, and a README walkthrough for terraform import of the existing prod VM (89.167.63.246).

  2. Data-plane naming: node-<hex> becomes eliza-core-<hex> going forward. generateNodeId() now sources entropy from crypto.getRandomValues() instead of Math.random().toString(16).slice(2, 10) — the latter silently strips trailing zeros and could produce short or colliding suffixes when node_id is UNIQUE in docker_nodes.

  3. Data-plane location default fixed: Hetzner deprecated cpx32 on ash (Ashburn), so defaultHcloudLocation = "ash" failed with "unsupported location for server type". Flipped to fsn1 to match the actual prod fleet.

  4. Migration 0132: disables the 6 legacy milady-core-* rows (enabled=false, capacity=8). They were inserted by hand in 2026-03 with capacity=100 (unrealistic for cpx32), have been health-check offline for weeks, and are now ignored by the autoscaler. Existing sandboxes keep running on the underlying Docker daemons until their next user-triggered restart, at which point the daemon provisions a replacement on a fresh autoscaled core. Ops follow-up (delete Hetzner servers + DB rows) is documented in ARCHITECTURE.md.

packages/cloud-infra/cloud/terraform/hetzner/ARCHITECTURE.md formalises the two-tier model (static control plane vs elastic data plane) so future ops actions have a clear runbook.

What this PR does NOT do (followups)

  • Terraform for headscale state + the cloudflared tunnel
  • terraform-apply GitHub workflow
  • Migrating the 4 remaining cron paths (pool-replenish, pool-health-check, pool-image-rollout, deployment-monitor) off the orphan container-control-plane service onto the daemon-queue pattern, then retiring the service entirely
  • Raising the Hetzner Cloud server-count limit (ops ticket)

Test plan

  • bun test packages/cloud-shared/src/lib/services/containers/node-autoscaler.test.ts — 5/5 pass (3 existing + 2 new for the generateNodeId() rename and entropy fix)
  • bunx tsc --noEmit on packages/cloud-shared — clean (pre-existing core/shared noise unrelated)
  • terraform init -backend=false && terraform validate on packages/cloud-infra/cloud/terraform/hetzner/control-plane/ — success
  • terraform fmt -recursive -check — clean
  • R2 backend bucket eliza-terraform-state exists in WEUR (verified)
  • After merge: import the existing prod VM into Terraform state via the steps in the README
  • After merge + Hetzner limit raise: verify a fresh eliza-core-<hex> provisions via autoscale and milady-core-* sandboxes drain naturally on restart

Ran the /clean skill on this PR

  • Agent 1 (Slop & Larp): inlined a trivial modules/manager-vm/ wrapper, removed a dead-exported helper (-87 LOC)
  • Agent 2 (Types & Structure): clean, surfaced the ash vs fsn1 inconsistency — fixed in this PR
  • Agent 3 (Dead Code & Legacy): no leftovers
  • Agent 4 (Defensive → Clean): caught the Math.random() slicing bug, replaced with crypto.getRandomValues
  • Agent 5 (Tests & DRY): added the 2 missing tests for generateNodeId()

Out-of-band ops actions needed for production rollout

  1. Generate R2 API token (already done — bucket eliza-terraform-state exists)
  2. Set HETZNER_CLOUD_API_KEY + CONTAINERS_AUTOSCALE_PUBLIC_SSH_KEY on the daemon (already done on staging VM at 89.167.63.246)
  3. Open Hetzner ticket to raise the server-count limit past 10 so autoscale can actually create replacement cores

Greptile Summary

This PR transitions the Hetzner setup to a two-tier architecture with Terraform IaC for the static control plane, fixes the defaultHcloudLocation fallback from the deprecated ash to fsn1, replaces Math.random()-based node ID generation with crypto.getRandomValues(), renames node IDs from node-<hex> to eliza-core-<hex>, and disables the six legacy milady-core-* DB rows via migration 0132.

  • Terraform control plane: New hetzner/control-plane/ module declares Hetzner VMs + Cloudflare DNS records backed by Cloudflare R2 state; cloud-init template handles first-boot setup of Docker, Bun, and the deploy user.
  • generateNodeId() fix: 4 bytes from crypto.getRandomValues() hex-encoded with padStart guarantees exactly 8 hex characters, preventing the trailing-zero truncation bug in the previous Math.random().toString(16).slice(2,10) path.
  • Migration 0132: Flips all milady-core-* rows to enabled=false, capacity=8 so the autoscaler ignores them while live sandboxes drain naturally on their next restart.

Confidence Score: 3/5

The TypeScript and migration changes are safe to merge, but the Terraform module has two gaps that would prevent a usable deployment: no guard against an empty SSH key list (produces a permanently inaccessible VM) and no authorized_keys injection for the deploy user (blocks the GitHub Actions deploy workflow).

The node-autoscaler entropy fix and the milady-core migration are clean and well-tested. The defaultHcloudLocation fix is a straightforward one-liner. The risk sits entirely in the new Terraform module: applying with the default empty ssh_public_keys creates a VM nobody can access, and the deploy user created by cloud-init has no SSH authorized keys so the expected deploy workflow cannot SSH in.

variables.tf (missing SSH key validation) and cloud-init/bootstrap.yaml.tftpl (missing ssh_authorized_keys for the deploy user) need attention before the module is run against any environment.

Security Review

  • Unverified installer scripts in cloud-init (bootstrap.yaml.tftpl lines 49–53): both Docker (curl -fsSL https://get.docker.com | sh) and Bun (curl -fsSL https://bun.sh/install | bash) are fetched and executed without checksum verification. The control-plane VM holds DATABASE_URL, HCLOUD_TOKEN, Headscale state, and the cloudflared tunnel — a supply-chain or MITM attack at bootstrap time would silently compromise the entire control plane.

Important Files Changed

Filename Overview
packages/cloud-infra/cloud/terraform/hetzner/control-plane/variables.tf Defines all Terraform input variables; ssh_public_keys defaults to [] with no minimum-length validation, allowing a zero-key apply that produces an inaccessible VM.
packages/cloud-infra/cloud/terraform/hetzner/control-plane/cloud-init/bootstrap.yaml.tftpl Cloud-init bootstrap template: creates deploy user but injects no SSH authorized keys for that user, blocking the GitHub Actions deploy workflow. Uses unverified curl
packages/cloud-infra/cloud/terraform/hetzner/control-plane/main.tf Core Terraform resources for control-plane VMs and Cloudflare DNS records. SSH key resources use positional list indexing which causes unnecessary plan churn on key reorder.
packages/cloud-shared/src/lib/services/containers/node-autoscaler.ts Replaces Math.random().toString(16).slice(2,10) with crypto.getRandomValues() over 4 bytes; renames prefix from node- to eliza-core-. Correct and well-tested.
packages/cloud-shared/src/lib/services/containers/node-autoscaler.test.ts Adds two new tests: validates eliza-core-[0-9a-f]{8} format and uniqueness across 50 consecutive provisions. Both correct.
packages/cloud-shared/src/db/migrations/0132_legacy_milady_cores_disable.sql Sets enabled=false, capacity=8 for all milady-core-* rows. SQL correct; WHERE clause appropriately scoped; journal idx:131 matches the 0-based convention.
packages/cloud-shared/src/lib/config/containers-env.ts Changes defaultHcloudLocation fallback from ash to fsn1, fixing provisioning failures after Hetzner deprecated cpx32 on ash.

Comments Outside Diff (1)

  1. packages/cloud-infra/cloud/terraform/hetzner/control-plane/variables.tf, line 536-540 (link)

    P1 No SSH key validation — VM becomes inaccessible on terraform apply

    var.ssh_public_keys defaults to [], and there is no validation block requiring at least one entry. When Hetzner creates a server, it injects the listed SSH public keys into root's ~/.ssh/authorized_keys. With an empty list, the VM boots with no authorized key for root, and the deploy user also has no keys (see the cloud-init template). The only recovery is Hetzner rescue mode.

Reviews (1): Last reviewed commit: "feat(cloud): Hetzner control plane IaC +..." | Re-trigger Greptile

Greptile also left 3 inline comments on this PR.

…ilady-core deprecation

Three coordinated pieces:

1. Terraform module `packages/cloud-infra/cloud/terraform/hetzner/control-plane/`
   declares the persistent VM(s) that host the orchestrator daemon
   (provisioning-worker, agent-router, headscale, cloudflared). Uses
   hetznercloud/hcloud + cloudflare providers, Cloudflare R2 as S3 state
   backend. Includes a cloud-init bootstrap template, tfvars examples,
   and a README walkthrough for both new-host bootstrap and `terraform
   import` of the existing prod VM (89.167.63.246) into state.

2. Data-plane naming: `node-<hex>` becomes `eliza-core-<hex>` going
   forward. `generateNodeId()` now sources entropy from
   `crypto.getRandomValues()` instead of `Math.random().toString(16)`,
   which silently strips trailing zeros and could produce short or
   colliding suffixes when `node_id` is UNIQUE in `docker_nodes`.

3. Data-plane location default fixed: Hetzner deprecated cpx32 on
   `ash` (Ashburn), so the previous `defaultHcloudLocation = "ash"`
   default fails with "unsupported location for server type". Flipped
   to `fsn1` to match the actual prod fleet.

4. Migration 0132 disables the 6 legacy `milady-core-*` rows
   (`enabled=false`, `capacity=8`). They were inserted by hand in 2026-03
   with `capacity=100` (unrealistic for cpx32), have been health-check
   offline for weeks, and are now ignored by the autoscaler. Existing
   sandboxes keep running on the underlying Docker daemons until their
   next user-triggered restart, at which point the daemon provisions a
   replacement on a fresh autoscaled core. Ops follow-up (delete Hetzner
   servers + DB rows) is documented in the architecture markdown.

ARCHITECTURE.md formalises the two-tier model (static control plane vs
elastic data plane) so future ops actions have a clear runbook.

Followups (separate PRs): Terraform modules for headscale state + the
cloudflared tunnel; terraform-apply GitHub workflow; rapatriating the 4
remaining cron paths off the orphan container-control-plane service onto
the daemon-queue pattern; raising the Hetzner Cloud server-count limit.

Tests:
- 2 new sociable tests for generateNodeId() asserting the prefix +
  exactly 8 lowercase hex chars + uniqueness across 50 calls. All 5
  node-autoscaler tests pass.

Out-of-band ops actions needed before merging to production:
- Generate R2 API token + create the bucket entry (already done:
  eliza-terraform-state in WEUR)
- Set environment secrets used by the daemon: HETZNER_CLOUD_API_KEY,
  CONTAINERS_AUTOSCALE_PUBLIC_SSH_KEY (already done on staging VM)
- Open Hetzner ticket to raise server-count limit past 10 so autoscale
  can actually create replacement cores
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eba8f55d-7b30-47b6-a2a1-bd3720ba41e9

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/cloud-iac-hetzner

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 22, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

Comment on lines +20 to +26
users:
- name: deploy
groups: sudo, docker
shell: /bin/bash
sudo: ALL=(ALL) NOPASSWD:ALL
lock_passwd: true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 deploy user has no SSH authorized keys — GitHub Actions deploy workflow will be unable to connect

Hetzner's SSH key injection only populates root's ~/.ssh/authorized_keys. The deploy user is created with lock_passwd: true and no ssh_authorized_keys entry, making it unreachable via SSH. The README's deploy step triggers deploy-eliza-provisioning-worker.yml which presumably SSHes into this user — that will fail until keys are injected out-of-band.

Comment on lines +49 to +53
- curl -fsSL https://get.docker.com | sh
- systemctl enable --now docker

# Bun runtime for the deploy user (the daemons run under bun/tsx).
- su - deploy -c 'curl -fsSL https://bun.sh/install | bash -s "bun-v1.3.13"'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 security Unverified curl | sh installs on the control-plane VM

Both Docker and Bun installs pipe remote scripts into a shell without checksum verification. This VM holds DATABASE_URL, HCLOUD_TOKEN, Headscale state, and the cloudflared tunnel — a higher-value target than a data-plane node. A supply-chain or MITM attack at bootstrap time would silently compromise the entire control plane.

}

resource "hcloud_ssh_key" "operators" {
for_each = { for idx, key in var.ssh_public_keys : idx => key }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Positional list indexing causes unnecessary key churn on reorder/insert

{ for idx, key in var.ssh_public_keys : idx => key } maps list position to the Hetzner SSH key resource address. Inserting a key before the last position shifts every subsequent key's each.key, causing Terraform to plan renames or destroy+recreates of downstream SSH key objects.

@standujar standujar marked this pull request as draft May 22, 2026 00:50
standujar added 2 commits May 22, 2026 02:54
Three issues raised by Greptile on the initial commit:

P1  deploy user had no SSH authorized_keys, so the auto-deploy
    workflow (which SSHes as `deploy`, not root) would fail until
    keys were copied out-of-band. cloud-init now expands the same
    operator key list into the deploy user via a Terraform-template
    loop, so first-boot the user is reachable.

P2 (sec) Replaced `curl get.docker.com | sh` with the official
    Docker apt repo + GPG-verified keyring (cloud-init handles the
    keyring). Replaced `curl bun.sh/install | bash` with a pinned
    GitHub release download whose SHA-256 is verified against the
    same release's SHASUMS256.txt before extracting.

P2  Keyed hcloud_ssh_key.operators by sha256(key) prefix instead of
    list index, so inserting an operator at the start of
    var.ssh_public_keys no longer cascades into renames/recreates
    of every subsequent SSH key resource.
…za-<n>

The shorter prefix matches the data-plane convention (eliza-core-<hex>)
and supports the in-place rename of the legacy prod VM (milady → eliza-1)
via Hetzner's PUT /servers/{id}. Environment moves to a label
(`environment = production|staging`) so the Hetzner Console can filter
without bloating every SSH command.

Also drops the inline import walkthrough from the README — it's a
one-shot adoption op that lives in operator scratch space, not in repo
docs that drift over time.
@lalalune lalalune merged commit afc8d84 into develop May 22, 2026
30 of 40 checks passed
@lalalune lalalune deleted the feat/cloud-iac-hetzner branch May 22, 2026 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants